Construction of a Bilingual Arabic-Spanish Lexicon of Verbs Based on a Parallel Corpus

نویسندگان

  • Doaa Samy
  • Antonio Moreno-Sandoval
  • José María Guirao
چکیده

Parallel corpora are considered an important resource for the development of linguistic tools. In this paper our main goal is the development of a bilingual lexicon of verbs. The construction of this lexicon is possible using two main resources: I) a parallel corpus (through the alignment); II) the linguistic tools developed for Spanish (which serve as a starting point for developing tools for Arabic language). At the end, aligned equivalent verbs are detected automatically from a parallel corpus Spanish-Arabic. To achieve this goal, we had to pass through different preparatory stages concerning the assessment of the parallel corpus, the monolingual tokenization of each corpus, a preliminary sentence alignment and finally applying the model of automatic extraction of equivalent verbs. Our method is hybrid, since it combines both statistical and linguistic approaches. 1. Arabic and Corpora In this introductory section, we would like to highlight the state of art in the field of Arabic corpora. The actual linguistic panorama reveals an increasing interest for building Arabic corpora. Considering the Arabic available corpora, there are three main written sources: 1. The Arabic Newswire, built by the LDC at Pennsylvania University. It is a compilation of articles from Agence France Presse and it consists of 76 million words. 2. Articles from the Lebanese newspaper Al-Nahar, with 140 million words. Available through ELRA. 3. Articles published in Al-Hayat newspaper, compilated by De Roeck and Goweder (2001). Also available through ELRA. Concerning the spoken corpora, the LDC has compiled two phone-recording corpora of Egyptian spoken Arabic (CALLHOME and CALLFRIEND). On the other hand, the parallel corpora are of especial importance for the multilingual language processing tools. The survey of the state of art in this aspect does not show any evidence of studies concerning the Arabic language in parallel corpora. The only evidence in this aspect is the work of Resnik and Smith (2003) for the STRAND project concerning the retrieval of parallel corpora from Internet. In the case of the Arabic language, the system was able to locate 2,190 URL pairs for English-Arabic documents. 2. The Spanish-Arabic parallel corpus The above survey shows the absence of the Arabic language from the panorama of cross-lingual parallel corpora. This can be explained if we take into consideration the following facts: Most of the computational and corpuslinguistic studies concerned with Arabic have studied this language in comparison mainly with English, but also with French. Spanish, on the other hand, has been studied mainly in comparison with English, and with other European languages. 2.1. Building the corpus In this section, we will briefly discuss the central features of the corpus and the selection criteria. In the compilation phase our main objective was to build a parallel corpus, that is, “a set of L1 texts and an equivalent set of L2 translations of L1” (McEnery, 1997). In other words, “a text which is available in two (or more) languages” (Somers, 2001). The first task consisted in locating documents in Spanish and Arabic through the World Wide Web. The results of the first search was not satisfactory considering the quantity and the quality. The available texts in Arabic with its translations in Spanish and viceversa are relatively scarce. Besides, the quality of the translation either Spanish-Arabic or Arabic-Spanish was not appropriate to allow a linguistic study. In a second round, search results were much better since it met both criteria quantity and quality. A set of official texts of the United Nations were located and compiled, since both Spanish and Arabic are, among others, UN official languages,. 2.2. Corpus characteristics Through the available UN documents, it was possible to build up a parallel Spanish-Arabic corpus consisting mainly of annual reports of different UN institutions, such as the Security Council. All texts are equivalent in both languages, with a total size of about 2 million tokens. The corpus reveals the following features: 1. It is a representation of modern standard Arabic and Spanish, used in formal official documents. 2. As UN documents, the quality of translation is guaranteed.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Cultural Influence on the Expression of Cathartic Conceptualization in English and Spanish: A Corpus-Based Analysis

This paper investigates the conceptualization of emotional release from a cognitive linguistics perspective (Cognitive Metaphor Theory). The metaphor weeping is a means of liberating contained emotions is grounded in universal embodied cognition and is reflected in linguistic expressions in English and Spanish. Lexicalization patterns which encapsulate this conceptualization i...

متن کامل

Bilingual lexicon extraction for a distant language pair using a small parallel corpus

The aim of this thesis proposal is to perform bilingual lexicon extraction for cases in which small parallel corpora are available and it is not easy to obtain monolingual corpus for at least one of the languages. Moreover, the languages are typologically distant and there is no bilingual seed lexicon available. We focus on the language pair Spanish-Nahuatl, we propose to work with morpheme bas...

متن کامل

A Linguistically Grounded Graph Model for Bilingual Lexicon Extraction

We present a new method, based on graph theory, for bilingual lexicon extraction without relying on resources with limited availability like parallel corpora. The graphs we use represent linguistic relations between words such as adjectival modification. We experiment with a number of ways of combining different linguistic relations and present a novel method, multi-edge extraction (MEE), that ...

متن کامل

Bilingual Lexicon Generation Using Non-Aligned Signatures

Bilingual lexicons are fundamental resources. Modern automated lexicon generation methods usually require parallel corpora, which are not available for most language pairs. Lexicons can be generated using non-parallel corpora or a pivot language, but such lexicons are noisy. We present an algorithm for generating a high quality lexicon from a noisy one, which only requires an independent corpus...

متن کامل

Bilingual Lexicon Induction: Effortless Evaluation of Word Alignment Tools and Production of Resources for Improbable Language Pairs

In this paper, we present a simple protocol to evaluate word aligners on bilingual lexicon induction tasks from parallel corpora. Rather than resorting to gold standards, it relies on a comparison of the outputs of word aligners against a reference bilingual lexicon. The quality of this reference bilingual lexicon does not need to be particularly high, because evaluation quality is ensured by s...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004